Tool-based annotation of out of memory errors. #5196

jmchilton · 2017-12-11T21:19:28Z

Update explicit stdio definitions as well as detect_errors-based definitions to allow this.
- If detect_errors="exit_code" is set, an extra attribute oom_exit_code can be defined on the same command block.
- If detect_errors='aggressive" is set, Galaxy will now look for newer strings indicating common memory problems. (Do we need to restrict this to just newer profile version tools for greater reproducibility or do we want to catch more errors on older tools?).
Update runner code to allow resubmission in such cases (and other tool errors).

Fixes #3320
Implements #3107

AsierGonzalez · 2017-12-12T10:01:24Z

Hi @jmchilton, this sounds incredibly interesting to me as we cannot use HISAT2 on collections with the wheat genome because more than 2 jobs in the same node make them crash silently. The jobs crash due to "out of memory" error but in the history they appear in green, so you do not realise this until downstream tools fail. Would your solution solve this problem?

gregvonkuster · 2017-12-12T12:29:20Z

This is fantastic @jmchilton - it will be extremely beneficial to my work, and I'm looking forward to making use of it. Thanks!

bernt-matthias · 2017-12-14T08:14:22Z

Unfortunately I can't say much to the changes of the code. But from the description and my understanding of the code I definitely think that this PR is a step in the right direction and can be really helpful.

My opinion is still that memory errors should be detected by the runner -- as far as possible (and for the other cases your changes are really useful). My reasoning is the following: like run time violations (which are already handled by the runner) also memory violations are not the tool's fault, but a fault of the runner (which requested to little resources or was just unable to ensure that sufficient resources are available). (Similarly also cluster errors that are mentioned in #3320 should be handled by the runner). A second line of reasoning is that the detection of memory errors should/must work for each and every tool. So if we let the tools decide if there is a memory violation one would need to enable the detection of memory violations in every tool. Because I guess (but maybe I'm wrong) that most tools don't check for memory problems, but use features of the programming language or just don't check at all, this would mean that most tools would need to check for the same error messages which means quite a bit of code duplication in the tool xml files.

Please note that if I write runner, then this seems the best place for me, but I'm not an expert. The important thing for me is that such problems should not be detected by single tools. So some of the job classes might also be a good place.

We have to distinguish programming language specific error messages (e.g. std::bad_alloc in the case of c++ programs) and program specific error messages (e.g. if the programs catch out-of-memory and writing a custom message to stderr/stdout or returning specific error codes).

Handling programming language specific error messages by the runner has the potential problem (as you mentioned earlier) that there might be tools where such an error does not indicate an error. The question with this is: does this affect a significant number of tools? It might be easier to implement an exception for these tools (e.g. by filtering the spurious messages from stdout/stderr) compared to adapting all tools to consider language specific error messages.

Clearly (as discussed earlier), tool specific memory error messages should be handled by the tool. For the detection of such messages your changes appear to go in this direction, but it seems that so far only exit codes can be used to detect OOM? Or is it possible to specify regular expressions for tool specific
memory error messages as well?

So overall my suggestion for this PR is to:

remove the programming language specific error messages from tool checks and move them to the runner (maybe BaseRunner?).
implement regular expression parsing in the tool xml to detect tool specific memory error messages

Maybe a bit of background on how I intend to implement job handling on our Galaxy instance (which uses a grid engine on a large cluster). Jobs are resubmitted with increasing run time starting with 15min and then 1h, 1day, 1week. A similar 2 stage escalation strategy is implemented for memory (we have normal nodes and high mem nodes). So with a few job destinations we cover all tools and input sizes -- without wasting to much run time. This of course relies on the detection of runtime and memory violations. All other strategies, eg. dynamic job destinations or specifying time/memory per tool seem impractical to me because they require to much time for administration or waste resources (the same tool started for bacteria and an eukaryote genome would get the same resources).

jmchilton · 2017-12-14T16:43:32Z

@AsierGonzalez

This tool has custom stdio parsing - so it would need to be updated to catch this - the lines https://github.com/galaxyproject/tools-iuc/blob/master/tools/hisat2/hisat2.xml could become:

<stdio>
       <regex level="fatal" match="hisat2-align exited with value 1" source="both" />
       <regex level="fatal" match="hisat2: not found" source="both" />
       <exit_code range="1:" />
</stdio>

<stdio>
       <regex level="fatal" match="hisat2-align exited with value 1" source="both" />
       <regex level="fatal" match="hisat2: not found" source="both" />
       <regex level="fatal_oom" match="out of memory" source="both" />
       <exit_code range="1:" />
</stdio>

It is weird in general that the HiSat exit code is ignore though - that is a fix that should be made regardless of this change to Galaxy I suspect. I've created an issue on the IUC issue tracker galaxyproject/tools-iuc#1634.

bernt-matthias · 2017-12-14T16:48:53Z

Great. Then the tool based errors seem to be covered by your PR. Still I would suggest that we move the language dependent errors (like std::bad_alloc) elsewhere.

jmchilton · 2017-12-14T18:38:15Z

Great. Then the tool based errors seem to be covered by your PR.

Yeah - I rebased this with one small tweak that enabled this (I thought it already was) and then added some docs for it in galaxy.xsd.

Still I would suggest that we move the language dependent errors (like std::bad_alloc) elsewhere.

These ones are only used if detect_errors="aggressive" has been set on a tool. This flag already mixes common errors from different languages (to some extent at least) and the tool author has sort of declared they want to be really aggressive when searching the output for errors - so I think it is okay to combine languages here - the whole point of the flag I think is to find as many errors as possible. Does that make sense?

I've never really liked aggressive though - I encourage the use of exit_code instead. So I get the desire on some level to want to catch classes of errors based on the programming language I think - though I'm not totally sold. Do you have an idea for what the tool syntax might look like?

Would we make detect_errors a list:

<command detect_errors="exit_code,c_memory_errors">

Or maybe exit_code would be implicit some new detectors that were language based (e.g. c_errors, java_errors):

<command detect_errors="c_errors">

bernt-matthias · 2017-12-18T07:45:53Z

You may leave it as it is.

I'm just looking for a way that a Galaxy admin can enable this behavior (checking for language specific memory errors) for all tools (regardless of the tool configurations).

Maybe job_conf?

bgruening · 2017-12-22T00:51:31Z

If detect_errors='aggressive" is set, Galaxy will now look for newer strings indicating common memory problems. (Do we need to restrict this to just newer profile version tools for greater reproducibility or do we want to catch more errors on older tools?).

I think this is ok. This entire PR looks great to me.
@bernt-matthias I'm with you that this should be ideally handled by an admin with job_conf.xml, but this is a very nice step towards to help certain job schedules to make a good decision. So I don't see any drawbacks here.

bernt-matthias · 2017-12-22T10:33:43Z

I'm completely with you. The PR is a great addition.

On job_conf.xml: Having a <stdio> tag in job_conf with the same functionality as the stdio tag in the tool xml files seems like a great idea to me.

bgruening · 2017-12-22T11:53:52Z

@jmchilton can you please resolve the conflict. Thanks!

- Update explicit stdio definitions as well as detect_errors-based definitions to allow this. - Update runner code to allow resubmission in such cases.

bgruening · 2017-12-27T14:37:19Z

Thanks @jmchilton! I missed the rebase somehow.

What do you think about @bernt-matthias last comment, especially for older tools?

On job_conf.xml: Having a tag in job_conf with the same functionality as the stdio tag in the tool xml files seems like a great idea to me.

@bernt-matthias I see that this could be useful especially for older tools, but I would also encourage Admins to submit this upstream to the tools.

bernt-matthias · 2018-01-03T14:44:57Z

Is there a way to document the new functionality at https://docs.galaxyproject.org/en/master/dev/schema.html?

jmchilton · 2018-01-09T18:43:38Z

This should have done that. Our docs just stopped generating/publishing - the build time grew too large I think. I've increased the timeout and re-ran the last such Jenkins job. https://jenkins.galaxyproject.org/view/All/job/latest-Sphinx-Docs/

nsoranzo · 2018-02-05T16:18:42Z

@jmchilton By looking at the code, it seems to me that StdioParser.parse_error_level() method in lib/galaxy/tools/parser/xml.py should also have been modified to return StdioErrorLevel.FATAL_OOM when err_level matches fatal_oom, or am I missing something?

@nsoranzo

Backend was enhanced to allow this in galaxyproject#5196 and a new entry point to setting exit code for such exit conditions but the custom parsing wasn't updated to allow it. Thanks to @nsoranzo for the catch.

jmchilton added area/jobs area/testing/integration area/tool-framework kind/enhancement labels Dec 11, 2017

jmchilton force-pushed the oom_detect branch from 1f5b9cb to a65c3db Compare December 12, 2017 02:55

galaxybot added this to the 18.01 milestone Dec 12, 2017

jmchilton mentioned this pull request Dec 12, 2017

A new runner for DRMAA (currently UNIVA) #4857

Closed

jmchilton mentioned this pull request Dec 14, 2017

Do not ignore the hisat2 exit code. galaxyproject/tools-iuc#1634

Closed

jmchilton force-pushed the oom_detect branch from a65c3db to bb5db5f Compare December 14, 2017 17:09

Tool-centric annotation of out of memory errors.

f017de3

- Update explicit stdio definitions as well as detect_errors-based definitions to allow this. - Update runner code to allow resubmission in such cases.

jmchilton force-pushed the oom_detect branch from bb5db5f to f017de3 Compare December 22, 2017 14:08

bgruening merged commit c78a23c into galaxyproject:dev Dec 27, 2017

jmchilton mentioned this pull request Feb 5, 2018

[18.01] Fix tool stdio level parsing for OOM level. #5460

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tool-based annotation of out of memory errors. #5196

Tool-based annotation of out of memory errors. #5196

jmchilton commented Dec 11, 2017

AsierGonzalez commented Dec 12, 2017

gregvonkuster commented Dec 12, 2017

bernt-matthias commented Dec 14, 2017

jmchilton commented Dec 14, 2017

bernt-matthias commented Dec 14, 2017

jmchilton commented Dec 14, 2017

bernt-matthias commented Dec 18, 2017

bgruening commented Dec 22, 2017

bernt-matthias commented Dec 22, 2017

bgruening commented Dec 22, 2017

bgruening commented Dec 27, 2017

bernt-matthias commented Jan 3, 2018

jmchilton commented Jan 9, 2018

nsoranzo commented Feb 5, 2018

Tool-based annotation of out of memory errors. #5196

Tool-based annotation of out of memory errors. #5196

Conversation

jmchilton commented Dec 11, 2017

AsierGonzalez commented Dec 12, 2017

gregvonkuster commented Dec 12, 2017

bernt-matthias commented Dec 14, 2017

jmchilton commented Dec 14, 2017

bernt-matthias commented Dec 14, 2017

jmchilton commented Dec 14, 2017

bernt-matthias commented Dec 18, 2017

bgruening commented Dec 22, 2017

bernt-matthias commented Dec 22, 2017

bgruening commented Dec 22, 2017

bgruening commented Dec 27, 2017

bernt-matthias commented Jan 3, 2018

jmchilton commented Jan 9, 2018

nsoranzo commented Feb 5, 2018